Exploring the latent space of variational autoencoders using Gaussian Heatmaps

Introduction

Since their emergence in Kingma, D. P., & Welling, M. (2013) variational autoencoders (VAE) have become one of the most popular approaches to unsupervised learning of high dimensional distributions. On a very broad scale a VAE takes examples $X$ that follow a certain distribution $P_X$ and learns a model $P$ from which we can sample under the restriction that $P$ has to be very close to $P_X$ in a probabilistic sense. This latter restriction guarantees that the VAE works as a generative model in the sense that once $P$ is known we can sample new observations that look like the original ones but are not the same.

The VAE overcomes several problems that older generative methods had. Most notably it doesn't use expensive computational inference procedures, doesn't impose any rigid a priori structure on the data nor does it make severe approximations. These advantages have permitted the extensive adoption of the method in the literature.

The architecture of the VAE is composed of both an encoder and a decoder. The encoder transforms the original input points to a lower dimensional latent space. In this latent space a sample is taken close to each of the inputs (according to an estimated variance) and is transformed back to the original high dimensional space by the decoder. In order for this to work the model is encouraged to create a latent space that is dense in the sense that points are projected near each other. Hence, directions in this space become meaningful as they represent variations in the features of the original data. This is in sharp contrast to the results obtained by regular dimension reduction methods and in particular autoencoder architectures.

For this project I decided to explore the latent space generated by the VAE for a particular data example. This is useful as it shows why this model has become so important for generative modelling and gives an idea on the difference with traditional autoencoder methods in terms of how the latent space organized and whether it is meaningful. The data used is inherently three dimensional but the input is a 36x36 image (see details in the next section). The main questions that I want to address are:

Data

The data used is generated by fixing a $2\times2$ covariance matrix and then taking a bivariate gaussian distribution centered at $(0,0)$ with the chosen covariance structure. The density function is evaluated on an evenly spaced grid from $-3$ to $3$ composed of $36$ points for each dimension. The $36\times36$ array is max-normalized and can be represented as a heatmap identifying the underlying distribution for the chosen covariance.

The data is inherently three dimensional by virtue of the symmmetry in the matrix. Hence, the parametrization can be done based on only two variances and a correlation. These parameters translate into visual properties of the heatmap. A relatively higher variance in the first component (say $x$) makes the elliptical contours of the gaussian wider while a higher variance in the second component ($y$) translates into a longer contour. The correlation induces rotations on the ellipse.

In total $11.875$ heatmaps are generated by defining all combinations in a grid from $-0.95$ to $0.95$ with steps of $0.1$ for the correlation and a grid from $0.1$ to $5$ with steps of size $0.2$ for the variances. After this step, data is split in a $0.8-0.2$ train-test. Some of the generated heatmaps are shown at the end of this section.

Data simulation

Data preprocessing and visualisation

Methodology

As following the exposition in Doersch, C. (2016) the VAE aims to maximize the probability of each $X$ in the training set under the entire generative process by using a set of conditional distributions over the latent space. The probability of a given point $X$ can be written in terms of latent points $z$ by using the law of total probability as,

$P(X)=\int P(X|z, \theta)p(z) dz$

The term $P(X|z, \theta)$ is replaced by a distribution $N(X| f(z, \theta), \sigma)$ with $f(\cdot)$ an arbitrary function. For the prior distribution $p(z)$ the VAE assumes that it comes from a standard multivariate normal distribution.

The only thing that remains is to find a way to simplify the high dimensional integral. This problem can be approached by repeatedly taking samples from $z$ and then computing the expected value of $P(X|z, \theta)$. However, this approach would be, at the very least, wasteful since for most $z$ usually $P(X|z) \approx 0$. A quick way around is to use a new distribution $Q(z|X)$ that takes a value of $X$ and finds $z$ values that had a high chance of generating the observed point and use that distribution to calculate the expected value.

All that remains is to link the surrogate distribution $Q(z|X)$ with the original probability $P(X|Z)$.

By using variational bayesian methods it can be shown that the log probability of $X$ can be written in terms of the surrogate distribution $Q(z|X)$ as

$log P(X) - D(Q(z|X) || P(z|X)) = E_{z-Q} log(P(X|z)) - D(Q(z|X) || P(z))$

Where $D(\cdot)$ is used to denote de Kullback-Leibler divergence.

Now assuming that the model has a high capacity the term $D(Q(z|X) || P(z|X))$ should be close to $0$ and hence the log probability of $X$ is approximately equal to the right hand side of the equation above.

Note that the first term on the right hand side takes the part of a decoder in the sense that it reconstructs the original inputs based on their latent representation. While the second term is related to the encoder part of the architecture since the distribution $Q(z|X)$ takes the original inputs and transforms them into their latent representations. This second term also has a penalty which tries to make $Q(z|X)$ as close as possible to the original prior. The problem becomes entirely tractable by specifying $Q(\cdot)$ as a gaussian distribution too.

The VAE architecture can be summarised in the following schematic representation.

VAE.png

Image taken from Doersch, C. (2016)

Defintion of the Variational Autoencoder architecture

Encoder architecture

The following architecture was chosen after several experimentation with different possibilities following roughly the guidelines in Leonov, S., Vasilyev, A., Makovetskii, A., & Kober, V. (2019). I created a mini-block cell composed of a convolutional layer followed by a max pooling one. The network is built using this blocks. The strides and the number of units in each layer can be controlled for further flexibility.

In this particular case we are interested in keeping track of the test set loss but also we want a model that gives an interpretable embedding in the latent space. Training is done until there is no apparent improvement on the test set loss.

I also used a scheduler for the step size in the optimizer as this has been shown to improve the results in certain difficult optimization problems.

Decoder architecture

Model

Training is done until there is no apparent improvement on the validation loss.

Example of decoded images

Reconstruction error on test set

In this section I show the average reconstruction error based on the test set measured with the frobenius norm. The measure shown is conditional on a fixed sample.

Results

In this section I present the results obtained by exploring the latent space that gets formed by the use of the VAE.

2D

Images above show the projections on the cartesian axis of the obtained manifold in the latent space. Colors represent how the three characteristics of the covariance matrix change as we move through the space.

The separation for the correlation is evident when projecting on the $x-z$ axis. We note that directions to the south are related to lower correlations while those that point to the north are related to higher correlations. The variance of the components is also separated less sharply by the bisection of the projected figure.

The projection on the $x-y$ axis shows a sharper separation of the points according to the variances of the first and the second components towards the edges of the figure.

2D Directions

Useful functions

Lines on 0-2 axis

In the figure above we can get a sense on how the heatmaps rotate as the correlation is modified. The chosen direction is evidently capturing this aspect of the data.

Lines on 0-1 axis

A horizontal line trajectory is shown on the projected $x-y$ space. In this case we can see that the direction represents changes in the variance of the components. It starts on a wide ellipse with a higher variance on the x-component and slowly transforms into a longer ellipse with increasing variance in the y-component.

Angular Directions

The figure above shows a particular angular direction in the $x-z$ projected space. We can see how a rotation in the latent space is reflected in a similar way in the decoded heatmaps.

3D

Projections on coordinate axis

Plots above show the $2D$ projections shown in the section before in the original latent space.

3D Space

Plot of the 3D manifold colored by latent characteristics.

Interactive Plots

Plots above show the manifold generated by the VAE in the latent space. The figures are colored by both the correlation and the difference in the variance between the components.

We note that the height of the point in the $z$ axis is directly related to the correlation. A simple trajectory that moves from north to south spans the complete available values for the correlation parameter.

Coloring by the difference between the variances we note that points that represent heatmaps with relatively big differences of the components are located in the sides of the manifold, towards the exterior. The central part of the manifold represents images where both components have a similar variance. This is confirmed by the projections on the 2D axis shown above. We note that points on the east of the first axis have a relatively higher variance on the first component while those of the west have a higher variance on the second.

Exploration based on specific directions on 3D

In this section I explore the two directions that seem to have a more straightforward interpretation in terms of encoded characteristics of the input images.

The first direction is obtained by transversing the convex hull of the manifold along the center which has a thin line of points that represent images with isovariances in both components (see the blue line in the plot above).

The second one is by finding the planes that bound the manifold in the $y-z$ directions both from above and below (x constant).

Convex Hull Isovariance

To find the convex hull of the manifold along the isovariance strip I filter data by difference in variances, bin for different values of $z$ and calculate splines to link both the $x-z$ axis and the $y-z$ axis.

The plot above show the sample of points that were reconstructed through the decoder. Reconstructed images are shown above too. It's hard to be completely accurate on the correct path to transverse to maintain the variance fixed however we can get sense that this direction of the manifold is related to changes in correlation. As can be seen in the plots above the correlation passes from its most negative value to its most positive one. It's easy to check the movement on the decoded images as this feature is related to the rotation of the ellipse.

Bounding planes - Diminishing variance

For this section I create a set of planes to bound the manifold from above and from below in the $y-z$ direction.

We can see that in the direction of the top bounding plane the variance of the first component remains fixed as well as the correlation while the variance of the second component decreases. This translate in a progressive decrease in the width of the ellipse passing from almost a circular shape to a very sharp figure.

The bounding plane from below shows a similar pattern, variance of the second component increases progressively as we transverse the plane in a northbound direction. This happens while mantaining the negative correlation fixed this is similar to what we saw in the last case but that one was having a positive correlation.

Comparing with traditional autoencoder

In this section I estimated a traditional autoencoder with a similar architecture to the VAE in order to compare the latent space generated in both cases.

Definition of autoencoder

Encoder architecture

Decoder architecture

Model

Reconstruction error on test set

This is the reconstruction error on the test set for the autoencoder. As we can see results are on average better than the ones obtained through the VAE. However, we see that the distribution is heavy tailed.

The heavy tailedness shows that there are a set of particular images in the test for which the model doesn't fit too well. In particular the decoded images in these cases are very different from the original ones.

Example of decoded images

Exploration of latent space for the Autoencoder

2D

Plots above show projections over the cartesian axis in the latent space generated by the AE. As in the case of the VAE we see that the correlation is the characteristic that is easier to infer in the latent space. Projections also show that in contrast to the results of the VAE the latent space of the AE is less dense and has more gaps between points.

Lines on 0-2 axis

When trying to transverse the projection space on the 0-2 axis some problems start to emerge. We note that due to the high amount of gaps and lack of density reconstructed images don't look similar to the original data. In this particular case, while the direction encodes a certain form of rotation shown images are not ellipses as the original data.

Lines on 0-1 axis

Again images can be seen to rotate along the trajectory but the shapes are very different to the original input data. Results suggest that the ability of the AE to generate new samples from empty points in the latent space is poor in comparison with the previous results obtained with the VAE.

3D

Interactive Plots

Convex Hull

It is important to note that similar to the results of the VAE the convex hull of the manifold created by the AE also shares a isovariance thin line. In general the shape of the figure is very similar between both, except for the density of the points. Perhaps another difference is the use of parabolas to encode a couple of particular points for the AE.

In this section I take some points on the convex hull and generate some random perturbations to check how these affect the decoded result.

We note that near the convex hull the space is more dense hence we don't see the type of problems that we had when we were transversing the projection space. Namely, there are not many empty spaces along the trajectories chosen. Hence, decoded results are still similar to the original data.

Transversing empty space

By analyzing the manifold we can see that it has some parabola shaped tails. For this section I decided to transverse the totally empty space that joins the left tail and the right one through the back of the manifold.

We see that while the direction still has a meaning namely increasing variance of the second component vs decreasing variance in the first one. Reconstructed images still look very different to the original ones.

General Discussion of results

Directions of the latent space for the VAE

Shape of the manifold that is created in the latent space

Diferences that arise when comparing with an Autoencoder

Extension to a four dimensional problem

Data generation

In this part I extend the results of the sections before by considering data that has four inherent characteristics and not three. Projections are still done on a 3-dimensional latent space. Instead of considering only the density of a bivariate normal distribution for this section I consider heatmaps generated by multivariate t-distributions with $2$ or $200$ degrees of freedom. The degrees of freedom are assigned according to the following stochastic equation $|x + y|^2 - l > 20$ where $l$ is a random variable taken from a $U(0,10)$ and $x, y$ represent the variances of the first and second components respectively.

The rest of the data generating process is carried out in the same manner as in the section before.

Data Preprocessing and validation

The plot above shows a sample of the generated images colored by whether they were created by the t-distribution with 2 degrees of freedom or 200.

Variational autoencoder architecture

The architecture used in this part resembles the one used in the section before.

Interactive plot

As we can see in the figure above, the generated manifold is similar to the one we obtained in before except that the new one has a more oval shaped border instead of conic.

The information on the degrees of freedom appears to be stored through layers of parabolic 2D figures over the 3D manifold.

KNN classifier

After checking the generated manifold I decided to estimate a KNN classifier to check whether a good separation is achieved in the latent space between points that come from the two different distributions.

Autoencoder architecture

A similar analysis is carried for an autoencoder architecture in this section.

Interactive plot

KNN classifier

General Discussion of results

For this last part, images that have four inherent dimensions were created but embedded in a 3D latent space using both the VAE and the AE in order to compare the results obtained. The result is tractable as the fourth dimension depends almost completely on the others, namely depends on the sum of the variances plus an error. Hence, what we are in fact looking for is whether the methods can discover this pattern.

Conclusion

An exploration of the latent space induced by a variational autoencoder is done in a particular data example and compared with results obtained through a traditional autoencoder. The data is created by evaluating the density of a bivariate normal distribution in a $36\times36$ grid between $-3$ and $3$ for different covariance matrices and presenting results as heatmaps.

More than eleven thousand images are generated with varying degrees of correlation and variance in both components according to a fixed grid of parameters. The data is split in a train and test set.

The architecture of the VAE follows basic principles in the image recognition literature using convolutional, transpose convolutional, max pooling and fully connected layers. Training is done until there is no apparent improvement in validation loss and the optimizer uses a learning rate scheduler.

Results show that directions in the latent space of the VAE become meaningful as they reflect variations in the three latent characteristics of the images namely variances of the two components and their correlation. In particular we find three particular trajectories that are monotone in these latent traits. The manifold formed by the VAE to embed the original data isn't completely arbitrary. We note that the shape though mostly smooth has a sharp kink reflecting the lack of injectivity in the mapping from the covariance matrices to the heatmaps when the variance is maxed out (modifications in correlations at this point barely change the image).

When comparing the manifolds of the two methods we note that overall the shape is very similar. However, the latent space generated through the autoencoder is much more sparse and has empty regions between points. These regions become problematic when there is interest in decoding points in them as usually results obtained are very different from the original data. The problem is more pervasive when trying to decode points that are obtained through projections in the latent space. The VAE doesn't have this problem. The reconstruction error of both methods shows that on average the reconstruction is better with the autoencoder however the distribution of the error is skewed showing again the high magnitude mistakes that happen when decoding points in empty regions of the space.

Extending the problem with data that has four inherent characteristics. We note that both methods can act as non-linear dimensionality reduction methods. The information of the fourth trait (in this case coming from a particular class) is kept in the latent space representation. In particular, a classified based on the latent space points achieves similar accuracy and F1 score for both methods.

It would be interesting to extend this project by considering further refinements of metrics to compare decoded images against the original ones instead of using the frobenius distance. In fact, using metrics that are invariant to translations and rotation could prove to be much more useful. Some further research could be done on how to compare generated images obtained by decoding random points in the latent space on the grounds of whether they are similar to what we expect to obtain and if it can be said that they come from the same distribution as the original data.

References

Doersch, C. (2016). Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.

Klys, J., Snell, J., & Zemel, R. (2018). Learning latent subspaces in variational autoencoders. arXiv preprint arXiv:1812.06190.

Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Leonov, S., Vasilyev, A., Makovetskii, A., & Kober, V. (2019 ). Analysis of the convolutional neural network architectures in image classification problems. In Applications of Digital Image Processing XLII (Vol. 11137, p. 111372E). International Society for Optics and Photonics.